{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by reading in the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/briand/Desktop/ds course/ipython/data/\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import os\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "\n",
    "#We assume data is in a parallel directory to this one called 'data'\n",
    "cwd = os.getcwd()\n",
    "datadir = '/'.join(cwd.split('/')[0:-1]) + '/data/'\n",
    "print(datadir)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Student put in read data command here:\n",
    "data = pd.read_csv(datadir + 'survey_responses_2019.csv', header = 0, sep=',')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the column headers and use something more descriptive"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['id', 'cs_python', 'cs_java', 'cs_c', 'cs_perl', 'cs_javascript',\n",
       "       'cs_r', 'cs_sas', 'profile_1', 'profile_2', 'profile_3', 'profile_4',\n",
       "       'profile_5', 'profile_6', 'profile_7', 'len_answer', 'experience_coded',\n",
       "       'experience'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Student put in code to look at column names\n",
    "data.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Column names like 'profile_1-profile_7' aren't very descriptive. As a quick data maintenance task, let's rename the columns starting with 'profile'. The dictionary in the next cell maps the integer index to a descriptive text.\n",
    "\n",
    "Tactically, let's loop through each column name. Within the loop let's check whether the column name starts with 'profile.' If it does, let's create a new name that swaps the key with the value using profile_mapping dictionary (i.e., profile_1 -> profile_Viz). We then add the new column name to a list. If it doesn't start with 'profile' just add the old column name to the list. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "profile_mapping = {1:'Viz',\n",
    "                   2:'CS',\n",
    "                   3:'Math',\n",
    "                   4:'Stats',\n",
    "                   5:'ML',\n",
    "                   6:'Bus',\n",
    "                   7:'Com'}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Student put code here to change the header names\n",
    "newcols = []\n",
    "\n",
    "for colname in data.columns:\n",
    "    \n",
    "    if colname[0:7] == 'profile':\n",
    "        \n",
    "        newcols.append('profile_{}'.format(profile_mapping[int(colname[-1])]))\n",
    "        \n",
    "    else:\n",
    "        \n",
    "        newcols.append(colname)\n",
    "    \n",
    "#Now swap the old columns with the values in newcols    \n",
    "data.columns = newcols    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use this data to illustrate common data analytic techniques. We have one numeric variable (len_answer) and different categorical variables which may carry some signal of the 'len_answer' variable. \n",
    "\n",
    "'Len_answer' is the character count of the response to the following question: \"Besides the examples given in lecture 1, discuss a case where data science has created value for some company. Please explain the company's goals and how any sort of data analysis could have helped the company achieve said goals.\" As this is a subjective business question, let's hypothesize that students with more professional experience might be more likely to give longer answers. \n",
    "\n",
    "In more technical terms, we'll test whether the variance of len_answer can be explained away by the categorical representation of a student's experience. \n",
    "\n",
    "The first thing we should do is look at the distribution of len_answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([16., 47., 46., 28.,  6.,  7.,  7.,  3.,  2.,  1.]),\n",
       " array([   0. ,  190.1,  380.2,  570.3,  760.4,  950.5, 1140.6, 1330.7,\n",
       "        1520.8, 1710.9, 1901. ]),\n",
       " <a list of 10 Patch objects>)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAADdpJREFUeJzt3W2s5OVdxvHvJQvV9EGWckI2gB5q\nsYY3Am4Ipg8voKU8VEBtGkhjVyXZmLQJpJq6lcTUxBegsVWjsVmFdGuw0MdASk2LSG1MhHqgy+OW\n7oJLhCzstoC00VRpf76Ye+2w7mHmnJk5c87N95NMzn/u+c/Mde7Zc53/uedhU1VIkja+H5t3AEnS\ndFjoktQJC12SOmGhS1InLHRJ6oSFLkmdsNAlqRMWuiR1wkKXpE5sWss7O/HEE2txcXEt71KSNrx7\n773321W1MGq/NS30xcVFlpaW1vIuJWnDS/LEOPu55CJJnbDQJakTFrokdcJCl6ROWOiS1AkLXZI6\nYaFLUicsdEnqhIUuSZ1Y03eKbkSLO26f233vv+6Sud23pI3HI3RJ6oSFLkmdsNAlqROuoa9j81q/\nd+1e2pg8QpekTljoktQJC12SOmGhS1InLHRJ6oSFLkmdsNAlqRMWuiR1wkKXpE5Y6JLUCQtdkjph\noUtSJyx0SeqEhS5JnbDQJakTFrokdcJCl6ROWOiS1AkLXZI6YaFLUicsdEnqhIUuSZ0Yu9CTHJPk\nG0m+2M6fluSeJPuS3JLkuNnFlCSNspIj9KuBPUPnrwc+VlVvBJ4DrppmMEnSyoxV6ElOAS4B/qad\nD3Ae8Nm2yy7g8lkElCSNZ9wj9D8FPgT8sJ1/PfB8Vb3Yzj8JnDzlbJKkFRhZ6EneBRysqntXcwdJ\ntidZSrJ06NCh1dyEJGkM4xyhvxm4NMl+4GYGSy1/BhyfZFPb5xTgqaNduap2VtXWqtq6sLAwhciS\npKMZWehV9eGqOqWqFoErgH+sqvcCdwHvbrttA26dWUpJ0kiTvA79d4EPJtnHYE39hulEkiStxqbR\nu/xIVX0V+Grbfhw4Z/qRJEmr4TtFJakTFrokdcJCl6ROWOiS1AkLXZI6YaFLUicsdEnqhIUuSZ2w\n0CWpExa6JHXCQpekTljoktQJC12SOmGhS1InLHRJ6oSFLkmdsNAlqRMWuiR1wkKXpE5Y6JLUCQtd\nkjphoUtSJyx0SeqEhS5JnbDQJakTFrokdcJCl6ROWOiS1AkLXZI6YaFLUicsdEnqhIUuSZ2w0CWp\nExa6JHXCQpekTows9CQ/nuTrSe5P8nCSP2jjpyW5J8m+JLckOW72cSVJyxnnCP37wHlV9fPAmcCF\nSc4Frgc+VlVvBJ4DrppdTEnSKCMLvQa+184e204FnAd8to3vAi6fSUJJ0ljGWkNPckyS3cBB4A7g\nMeD5qnqx7fIkcPJsIkqSxjFWoVfVD6rqTOAU4Bzg58a9gyTbkywlWTp06NAqY0qSRlnRq1yq6nng\nLuAXgeOTbGoXnQI8tcx1dlbV1qraurCwMFFYSdLyxnmVy0KS49v2TwDvAPYwKPZ3t922AbfOKqQk\nabRNo3dhC7AryTEMfgF8uqq+mOQR4OYkfwh8A7hhhjklSSOMLPSqegA46yjjjzNYT5ckrQO+U1SS\nOmGhS1InLHRJ6sQ4T4rqFWZxx+1zu+/9110yt/uWNjqP0CWpExa6JHXCQpekTljoktQJC12SOmGh\nS1InLHRJ6oSFLkmdsNAlqRMWuiR1wkKXpE5Y6JLUCQtdkjphoUtSJyx0SeqEhS5JnbDQJakTFrok\ndcJCl6ROWOiS1AkLXZI6YaFLUicsdEnqhIUuSZ2w0CWpExa6JHXCQpekTljoktQJC12SOmGhS1In\nLHRJ6sTIQk9yapK7kjyS5OEkV7fxE5LckWRv+7p59nElScsZ5wj9ReC3q+oM4Fzg/UnOAHYAd1bV\n6cCd7bwkaU5GFnpVHaiq+9r2d4E9wMnAZcCuttsu4PJZhZQkjbaiNfQki8BZwD3ASVV1oF30NHDS\nVJNJklZk7EJP8hrgc8A1VfXC8GVVVUAtc73tSZaSLB06dGiisJKk5Y1V6EmOZVDmN1XV59vwM0m2\ntMu3AAePdt2q2llVW6tq68LCwjQyS5KOYpxXuQS4AdhTVR8duug2YFvb3gbcOv14kqRxbRpjnzcD\nvwY8mGR3G/s94Drg00muAp4A3jObiJKkcYws9Kr6ZyDLXHz+dONIklbLd4pKUicsdEnqhIUuSZ2w\n0CWpExa6JHXCQpekTljoktQJC12SOmGhS1InLHRJ6oSFLkmdsNAlqRPjfNriurC44/Z5R5Ckdc0j\ndEnqhIUuSZ2w0CWpExa6JHXCQpekTljoktQJC12SOmGhS1InLHRJ6oSFLkmdsNAlqRMWuiR1wkKX\npE5Y6JLUCQtdkjphoUtSJyx0SeqEhS5JnbDQJakTFrokdcJCl6ROWOiS1ImRhZ7kxiQHkzw0NHZC\nkjuS7G1fN882piRplHGO0D8BXHjE2A7gzqo6HbiznZckzdHIQq+qrwHPHjF8GbCrbe8CLp9yLknS\nCq12Df2kqjrQtp8GTppSHknSKk38pGhVFVDLXZ5ke5KlJEuHDh2a9O4kSctYbaE/k2QLQPt6cLkd\nq2pnVW2tqq0LCwurvDtJ0iirLfTbgG1textw63TiSJJWa5yXLX4K+BfgTUmeTHIVcB3wjiR7gbe3\n85KkOdo0aoequnKZi86fchZJ0gR8p6gkdcJCl6ROWOiS1AkLXZI6YaFLUicsdEnqhIUuSZ2w0CWp\nEyPfWCS9EizuuH3eEdbc/usumXcETZlH6JLUCQtdkjphoUtSJ1xD17rySlzLlqbFI3RJ6oSFLkmd\nsNAlqRMWuiR1wkKXpE5Y6JLUCQtdkjphoUtSJyx0SeqEhS5JnbDQJakTFrokdcJCl6RO+GmL0ivU\nPD/Z0v8taTY8QpekTljoktQJC12SOuEauqQ1N6/1+97X7j1Cl6ROWOiS1AkLXZI64Rq6pFeM3tfu\nJzpCT3JhkkeT7EuyY1qhJEkrt+pCT3IM8JfARcAZwJVJzphWMEnSykxyhH4OsK+qHq+q/wZuBi6b\nTixJ0kpNUugnA/8+dP7JNiZJmoOZPymaZDuwvZ39XpJHV3lTJwLfnk6qmTDfZMw3GfNNZqb5cv3E\nN/HT4+w0SaE/BZw6dP6UNvYSVbUT2DnB/QCQZKmqtk56O7NivsmYbzLmm8x6zzeuSZZc/hU4Pclp\nSY4DrgBum04sSdJKrfoIvapeTPIB4MvAMcCNVfXw1JJJklZkojX0qvoS8KUpZRll4mWbGTPfZMw3\nGfNNZr3nG0uqat4ZJElT4Ge5SFInNkShz/sjBpKcmuSuJI8keTjJ1W38I0meSrK7nS4eus6HW95H\nk7xzDTLuT/Jgy7HUxk5IckeSve3r5jaeJH/e8j2Q5OwZZ3vT0BztTvJCkmvmPX9JbkxyMMlDQ2Mr\nnrMk29r+e5Nsm3G+P07yzZbhC0mOb+OLSf5raC4/PnSdX2j/Nva17yEzzLfix3RWP9/L5LtlKNv+\nJLvb+JrP30xU1bo+MXjC9THgDcBxwP3AGWucYQtwdtt+LfAtBh938BHgd46y/xkt56uA01r+Y2ac\ncT9w4hFjfwTsaNs7gOvb9sXA3wMBzgXuWePH82kGr6ud6/wBbwPOBh5a7ZwBJwCPt6+b2/bmGea7\nANjUtq8fyrc4vN8Rt/P1ljnte7hohvlW9JjO8uf7aPmOuPxPgN+f1/zN4rQRjtDn/hEDVXWgqu5r\n298F9vDy74q9DLi5qr5fVf8G7GPwfay1y4BdbXsXcPnQ+Cdr4G7g+CRb1ijT+cBjVfXEy+yzJvNX\nVV8Dnj3Kfa9kzt4J3FFVz1bVc8AdwIWzyldVX6mqF9vZuxm8/2NZLePrquruGrTTJ4e+p6nnexnL\nPaYz+/l+uXztKPs9wKde7jZmOX+zsBEKfV19xECSReAs4J429IH25++Nh/88Zz6ZC/hKknszeHcu\nwElVdaBtPw2cNMd8h13BS3+I1sv8HbbSOZtn1t9kcMR42GlJvpHkn5K8tY2d3DKtZb6VPKbzmr+3\nAs9U1d6hsfUyf6u2EQp93UjyGuBzwDVV9QLwV8DPAGcCBxj8CTcvb6mqsxl8+uX7k7xt+MJ2dDHX\nlzRl8Aa0S4HPtKH1NH//z3qYs+UkuRZ4EbipDR0AfqqqzgI+CPxdktfNIdq6fkyHXMlLDyzWy/xN\nZCMU+lgfMTBrSY5lUOY3VdXnAarqmar6QVX9EPhrfrQssOaZq+qp9vUg8IWW5ZnDSynt68F55Wsu\nAu6rqmda1nUzf0NWOmdrnjXJrwPvAt7bfunQljK+07bvZbAu/bMty/CyzEzzreIxncf8bQJ+Bbhl\nKPe6mL9JbYRCn/tHDLT1thuAPVX10aHx4XXnXwYOP5t+G3BFklclOQ04ncETK7PK9+okrz28zeCJ\ns4dajsOvutgG3DqU733tlRvnAv8xtMwwSy85Klov83eElc7Zl4ELkmxuywsXtLGZSHIh8CHg0qr6\nz6HxhQz+jwKSvIHBnD3eMr6Q5Nz27/h9Q9/TLPKt9DGdx8/324FvVtX/LaWsl/mb2LyflR3nxOAV\nBt9i8Fvz2jnc/1sY/On9ALC7nS4G/hZ4sI3fBmwZus61Le+jzPhZcQavELi/nR4+PEfA64E7gb3A\nPwAntPEw+M9JHmv5t67BHL4a+A7wk0Njc50/Br9cDgD/w2Bt9KrVzBmDtex97fQbM863j8Ga8+F/\nhx9v+/5qe+x3A/cBvzR0O1sZFOtjwF/Q3lA4o3wrfkxn9fN9tHxt/BPAbx2x75rP3yxOvlNUkjqx\nEZZcJEljsNAlqRMWuiR1wkKXpE5Y6JLUCQtdkjphoUtSJyx0SerE/wKsFSIrm7oaogAAAABJRU5E\nrkJggg==\n",
      "text/plain": [
       "<Figure size 600x400 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "#Student - build and plot a histogram here\n",
    "plt.hist(data.len_answer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like we may have at least one strong outlier and somewhat of a log-normal distribution. Let's also use the Pandas describe() method to get a stronger sense of the distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count     163.000000\n",
       "mean      523.478528\n",
       "std       348.918087\n",
       "min         0.000000\n",
       "25%       281.000000\n",
       "50%       471.000000\n",
       "75%       648.000000\n",
       "max      1901.000000\n",
       "Name: len_answer, dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.len_answer.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's consider cleaning up the data. We'll remove the top k values as well as those with a length less than 50 (which we think is a generous minimum to communicate a reasonable answer.\n",
    "\n",
    "Create a new data_frame that removes these outliers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((155, 18), (163, 18))"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Write a function to get the kth largest value of an array\n",
    "def get_kth_largest(inarray, k):\n",
    "    inarray.sort()\n",
    "    return inarray[-k]\n",
    "\n",
    "k = 3\n",
    "kth_largest = get_kth_largest(data.len_answer.values, 3)\n",
    "#Question = why did we wrap the series into an np.array() call in the above function call?\n",
    "\n",
    "#Student create a filtered data frame here\n",
    "outlier_filter = (data.len_answer > 50) & (data.len_answer < kth_largest)\n",
    "data_clean = data[outlier_filter]\n",
    "\n",
    "#Compare the shape of both dataframes\n",
    "data_clean.shape, data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have cleaned our data, let's run a pairwise t-test on each experience level to see if their difference in len_answer is statistically significant. To run a t-test, we'll need the mean, standard-deviation and count for each group. We can achieve this with a pandas groupby operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"3\" halign=\"left\">len_answer</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>experience</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2-5 years, I'm getting good at what I do!</th>\n",
       "      <td>601.965517</td>\n",
       "      <td>339.538498</td>\n",
       "      <td>29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5+ years, I'm a veteran!</th>\n",
       "      <td>733.153846</td>\n",
       "      <td>403.083913</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>&lt; 2 years, I'm fresh!</th>\n",
       "      <td>503.395349</td>\n",
       "      <td>284.094396</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>None, I just finished my undergrad!</th>\n",
       "      <td>449.700000</td>\n",
       "      <td>249.228278</td>\n",
       "      <td>70</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           len_answer                  \n",
       "                                                 mean         std count\n",
       "experience                                                             \n",
       "2-5 years, I'm getting good at what I do!  601.965517  339.538498    29\n",
       "5+ years, I'm a veteran!                   733.153846  403.083913    13\n",
       "< 2 years, I'm fresh!                      503.395349  284.094396    43\n",
       "None, I just finished my undergrad!        449.700000  249.228278    70"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Student input code here\n",
    "data_clean_grouped = data_clean[['len_answer', 'experience']].groupby(['experience']).agg(['mean', 'std', 'count'])\n",
    "data_clean_grouped"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visually, we can see a potential split between the [0, 2] year experience range and the [2+] experience range. Let's be more rigorous and run t-tests. Let's write a function that takes in the necessary statistics and returns a p-value.\n",
    "\n",
    "Remember, the t-stat for the difference between two means is:\n",
    "\n",
    "<center>$t = \\frac{\\hat{\\mu_1} - \\hat{\\mu_2}}{\\sqrt{\\frac{\\hat{\\sigma_1}^2}{n_1} + \\frac{\\hat{\\sigma_2}^2}{n_2}}}$</center>\n",
    "\n",
    "The p-value can be found using a t-distribution, but for simplicity, let's approximate this with the normal distribution. For the 2-tailed test, the p-value is: 2 * (1 - Norm.CDF(T))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Student complete the function\n",
    "from scipy.stats import norm\n",
    "\n",
    "def pvalue_diffmeans_twotail(mu1, sig1, n1, mu2, sig2, n2):\n",
    "    '''\n",
    "    P-value calculator for the hypothesis test of mu1 != mu2.\n",
    "    Takes in the approprate inputs to compute the t-statistic for the difference between means\n",
    "    Outputs a p-value for a two-sample t-test.\n",
    "    '''\n",
    "    diff = mu1 - mu2\n",
    "    stderror = np.sqrt(sig1**2 / n1 + sig2**2 / n2)\n",
    "    t = diff / stderror\n",
    "    \n",
    "    p_value = 2 * (1- norm.cdf(np.abs(t)))\n",
    "    \n",
    "    return (t, p_value)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now loop through all possible pairs in data_clean_grouped and perform a t-test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Two tailed T-Test between groups: 2-5 years, I'm getting good at what I do! and 5+ years, I'm a veteran!\n",
      "Diff = -131.0 characters\n",
      "The t-stat is -1.022 and p-value is 0.307\n",
      "\n",
      "Two tailed T-Test between groups: 2-5 years, I'm getting good at what I do! and < 2 years, I'm fresh!\n",
      "Diff = 99.0 characters\n",
      "The t-stat is 1.288 and p-value is 0.198\n",
      "\n",
      "Two tailed T-Test between groups: 2-5 years, I'm getting good at what I do! and None, I just finished my undergrad!\n",
      "Diff = 152.0 characters\n",
      "The t-stat is 2.184 and p-value is 0.029\n",
      "\n",
      "Two tailed T-Test between groups: 5+ years, I'm a veteran! and < 2 years, I'm fresh!\n",
      "Diff = 230.0 characters\n",
      "The t-stat is 1.916 and p-value is 0.055\n",
      "\n",
      "Two tailed T-Test between groups: 5+ years, I'm a veteran! and None, I just finished my undergrad!\n",
      "Diff = 283.0 characters\n",
      "The t-stat is 2.45 and p-value is 0.014\n",
      "\n",
      "Two tailed T-Test between groups: < 2 years, I'm fresh! and None, I just finished my undergrad!\n",
      "Diff = 54.0 characters\n",
      "The t-stat is 1.021 and p-value is 0.307\n",
      "\n"
     ]
    }
   ],
   "source": [
    "#Student put in code here:\n",
    "\n",
    "#get distinct values in the data frame for the experience variable\n",
    "\n",
    "#data_grouped = data[['len_answer', 'experience']].groupby(['experience']).agg(['mean', 'std', 'count'])\n",
    "#ttest_data = data_grouped\n",
    "\n",
    "\n",
    "ttest_data = data_clean_grouped\n",
    "\n",
    "\n",
    "grps = ttest_data.index.values\n",
    "\n",
    "#Now loop through each pair\n",
    "for i, grp1 in enumerate(grps):\n",
    "    for grp2 in grps[i + 1:]:\n",
    "    \n",
    "        '''\n",
    "        hint: since the grp name is the index, pull out the record corresponding to that index value. \n",
    "        Also, the result of groupby uses a multi-index. So be sure to index on 'len_answer' as well.\n",
    "        Then pull out the mean, std, and cnt from that result.   \n",
    "        '''        \n",
    "        row1 = ttest_data.loc[grp1].loc['len_answer']\n",
    "        row2 = ttest_data.loc[grp2].loc['len_answer']\n",
    "    \n",
    "        tstat, p_value = pvalue_diffmeans_twotail(row1['mean'], row1['std'], row1['count'], row2['mean'], row2['std'], row2['count'])\n",
    "    \n",
    "        print('Two tailed T-Test between groups: {} and {}'.format(grp1, grp2))\n",
    "        print('Diff = {} characters'.format(round(row1['mean'] - row2['mean'], 0)))\n",
    "        print('The t-stat is {} and p-value is {}'.format(round(tstat, 3), round(p_value, 3)))\n",
    "        print('')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are some observations you might have about the above results? Are there any with large deviances that are not statistically significant at at least a 95% level? Is there any issue with using 95% as our threshold for statistical significance? In fact there is. We are running multiple hypothesis tests at once, and doing this is known to increase the probability that we have at least one false positive (i.e., $P(False Positive) = 1 - .95^{Ntests}$). We can apply a simplye but conservative method called the <a href=\"https://en.wikipedia.org/wiki/Bonferroni_correction\">Bonferoni Correction</a>, which says that if we normally would care about an alpha level of $\\alpha$ for significance testing, and we're doing $N$ tests, then our new significance level should be $\\alpha/N$. This correction is conservative because it assumes that each test is independent. Since each group is repeatedly sampled across pairs, we know that our individual tests are not indeed independent. Nonetheless, we'll see how the results hold under this new regime. \n",
    "\n",
    "Also, how do the numbers change if you rerun it using the original data, and not the cleaned data. What is the effect of outliers on the results?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Two tailed T-Test between groups: 2-5 years, I'm getting good at what I do! and 5+ years, I'm a veteran!\n",
      "Diff = -75.0 characters\n",
      "The t-stat is -0.548 and p-value is 0.584\n",
      "\n",
      "Two tailed T-Test between groups: 2-5 years, I'm getting good at what I do! and < 2 years, I'm fresh!\n",
      "Diff = 176.0 characters\n",
      "The t-stat is 1.925 and p-value is 0.054\n",
      "\n",
      "Two tailed T-Test between groups: 2-5 years, I'm getting good at what I do! and None, I just finished my undergrad!\n",
      "Diff = 205.0 characters\n",
      "The t-stat is 2.357 and p-value is 0.018\n",
      "\n",
      "Two tailed T-Test between groups: 5+ years, I'm a veteran! and < 2 years, I'm fresh!\n",
      "Diff = 251.0 characters\n",
      "The t-stat is 2.092 and p-value is 0.036\n",
      "\n",
      "Two tailed T-Test between groups: 5+ years, I'm a veteran! and None, I just finished my undergrad!\n",
      "Diff = 280.0 characters\n",
      "The t-stat is 2.4 and p-value is 0.016\n",
      "\n",
      "Two tailed T-Test between groups: < 2 years, I'm fresh! and None, I just finished my undergrad!\n",
      "Diff = 29.0 characters\n",
      "The t-stat is 0.522 and p-value is 0.602\n",
      "\n"
     ]
    }
   ],
   "source": [
    "#Rerun everything without cleaning outliers\n",
    "data_grouped = data[['len_answer', 'experience']].groupby(['experience']).agg(['mean', 'std', 'count'])\n",
    "ttest_data = data_grouped\n",
    "\n",
    "\n",
    "grps = ttest_data.index.values\n",
    "\n",
    "#Now loop through each pair\n",
    "for i, grp1 in enumerate(grps):\n",
    "    for grp2 in grps[i + 1:]:\n",
    "    \n",
    "        '''\n",
    "        hint: since the grp name is the index, pull out the record corresponding to that index value. \n",
    "        Also, the result of groupby uses a multi-index. So be sure to index on 'len_answer' as well.\n",
    "        Then pull out the mean, std, and cnt from that result.   \n",
    "        '''        \n",
    "        row1 = ttest_data.loc[grp1].loc['len_answer']\n",
    "        row2 = ttest_data.loc[grp2].loc['len_answer']\n",
    "    \n",
    "        tstat, p_value = pvalue_diffmeans_twotail(row1['mean'], row1['std'], row1['count'], row2['mean'], row2['std'], row2['count'])\n",
    "    \n",
    "        print('Two tailed T-Test between groups: {} and {}'.format(grp1, grp2))\n",
    "        print('Diff = {} characters'.format(round(row1['mean'] - row2['mean'], 0)))\n",
    "        print('The t-stat is {} and p-value is {}'.format(round(tstat, 3), round(p_value, 3)))\n",
    "        print('')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [py35]",
   "language": "python",
   "name": "Python [py35]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
